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Using gamma regression for photometric 
redshifts of survey galaxies 
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Abstract Machine learning techniques offer a plethora of opportunities in tackling 
big data within the astronomical community. We present the set of Generalized 
Linear Models as a fast alternative for determining photometric redshifts of galaxies, 
a set of tools not commonly applied within astronomy, despite being widely used 
in other professions. With this technique, we achieve catastrophic outlier rates of 
the order of ~ 1%, that can be achieved in a matter of seconds on large datasets of 
size ~ 1,000,000. To make these techniques easily accessible to the astronomical 
community, we developed a set of libraries and tools that are publicly available. 


1 Introduction 

Generalized Linear Models [GLMsi fTOll are widely used throughout other scientific 
disciplines such as: biology Cl, medicine ||8l, and economics oa, and is available 
within the overwhelming majority of contemporary statistical software packages. 
However, they have been very little used within the astronomical community 

There are plenty of opportunities to apply GLMs within astronomy, and one par¬ 
ticularly important problem is the estimation of photometric redshifts (photo-z). 

Galaxy spectra are made up of many of its physical properties, including mor¬ 
phology, age, metallicity, star formation history, merging history, and a host of other 
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confounding factors in addition to its redshift. This makes robust estimation of 
photo-zs a difficult task. Estimation is usually done in two ways, by template fit¬ 
ting, or by using machine learning techniques. 

There exist several studies that have investigated the advantages of the publicly 
available codes that estimate the photo-z of galaxies [for a glimpse on the diversity 
of existent methods, see|2l and references therein]. The overall performance of most 
codes is good, demonstrating catastrophic errors from 5 to 9%, which is consid¬ 
ered reliable within the field. There are also a number of growing techniques that 
implement a hybrid approach of template and machine learning techniques El. 

Despite the current advancement within this field, there still exist large practical 
difficulties. In the next years there are a large number of surveys that will start having 
big data catalogues, e.g., the Large Synoptic Survey Te/eicop^lH, EUCLir^ fVS\ 
or the Wide-Field Survey Infrared ffi) . Current techniques will become 

difficult to employ if they require large training sets, and as such, this warrants the 
need for fast and reliable photo-z methods that are capable of robustly estimating 
redshifts quickly, and on large training datasets. 

We introduce GLMs as a new technique to quickly and robustly estimate galaxy 
photo-zs. We show that it can run in a matter of seconds on a single core cornputer, 
even for millions of objects. As part of the COsmostatistics INitiative (COHSHl col¬ 
laboration, we created and distributed easy to use software, and web-applications 
for use of the wider community in estimating photo-z^ 

2 Methodology 

We will not go into the details of deriving GLMs or the formula that is used in our 
technique, we instead encourage the reader to see the details in Elliott et al. 0 and 
references therein. However, we note the importance of GLMs, such that they allow 
you to choose the type of distribution you want to model. GLMs are applicable to the 
entire set of exponential families of distributions; Gaussian/normal, gamma, inverse 
Gaussian, Bernoulli, binomial, Poisson, and negative binomial. Lor example, in this 
study, we want to predict the photo-zs of galaxies from multi-wavelength photome¬ 
try. Lor such a study, the gamma distribution is favourable, as a redshift is positive 
and continuous, as is the gamma distribution. To then use the gamma distribution to 
predict redshifts, we utilised the following machine learning methodology: 

1. The data was randomly split into training and test sets. 

2. Robust principal component analysis (e.g., Candes et al.|2] de Souza et al. a was 
carried out on the complete data set. 

3. We utilised a gamma family distribution to reflect the fact that measured redshifts 
are positive and continuous. 


* http://www.lsst.org/lsst 
^http://sci.esa.int/euclid 
^ http://wfirst.gsfc.nasa.gov 

^https://asaip.psu.edu/organizations/iaa/iaa-working-group-of-cosmostatisties 
^ https://github.com/COINtoolbox 
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4. The predicted photo-z for the test data was calculated using the principal com¬ 
ponent projections of the test data set and the best-fit GLM using the training 
sample. 

5. To measure how well the photo-zs were estimated, we employed a metric com¬ 
monly used in the literature, specifically, the catastrophic error. 


3 Data Samples 

We used two publicly available galaxy datasets to test the technique outlined in 
the previous section. The first was the PHoto-z Accuracy Testing (THAT), an inter¬ 
national intitiative to identify the most promising photo-z methods. We used their 
publicly available simulated datasets that contains 169,520 simulated galaxies with 
redshifts ranging from z = 0.02 — 2.24, and magnitudes in 11 filters (m, g, r, i, z, Y, 
J, H, K, IRACl, and IRAC2). 

Given that this dataset was purely synthetic, we also used a real dataset acquired 
from the Sloan Digital Sky Survey [SDSS; M- For details on the query see Elliott 
et al. Q. The sample used contained 1,347,640 galaxies with a redshift range of 
z = 0 — 1.0, with magnitudes in 5 filters (u\ g', P, i', and z!). Comparisons with 
dereddened values showed no inconsistencies 


4 Results 

Both data sets were fit using an AMD Athlon X2 Dual-Core QL-64 processor with 
1.7 GB RAM on the Ubuntu 10.04 operating system, which represents an old laptop 
at today’s standards. We achieved catastrophic errors of 1.4% for the PHATO data set 
and 8% for the SDSS data set, within ^ 1200, and 10 seconds. Lower catastrophic 
errors of ~ 1% could be achieved when using more principal components, but would 
take longer computational time, ^ 5000 s. We plot the best-fit GLM models for 
SDSS datasets in Fig.[2 

5 Conclusions 

The astronomical community has left Generalized Linear Models relatively un¬ 
touched, despite its use throughout the academic world. We have demonstrated their 
ease of use and quick applicability to estimate the photo-zs of galaxies. This tech¬ 
nique has been shown to be competitive with current techniques implemented, that 
can require larger training sets and longer time for their algorithms to learn. Such 
properties of this technique will become important in the close future when upcom¬ 
ing wide field sky surveys, such as the LSST, will start collecting data at enormous 
rates per day. 

To make GLMs more accessible to the community, we developed a set of soft¬ 
ware libraries written in Python and R, that can be easily implemented into people’s 
own work. A web application is also available to be used instantly without the need 
for installation. 
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Fig. 1 The 2D probability density of the predicted redshift from the GLM fit vs. the spectroscopic 
redshift (central plots). The upper and right subplots in each panel depict the redshift distribution 
along photo-z and Zspeo respectively. 
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